Informed kmer selection for de novo transcriptome assembly

نویسندگان

Dilip A. Durai

Marcel H. Schulz

چکیده

MOTIVATION De novo transcriptome assembly is an integral part for many RNA-seq workflows. Common applications include sequencing of non-model organisms, cancer or meta transcriptomes. Most de novo transcriptome assemblers use the de Bruijn graph (DBG) as the underlying data structure. The quality of the assemblies produced by such assemblers is highly influenced by the exact word length k As such no single kmer value leads to optimal results. Instead, DBGs over different kmer values are built and the assemblies are merged to improve sensitivity. However, no studies have investigated thoroughly the problem of automatically learning at which kmer value to stop the assembly. Instead a suboptimal selection of kmer values is often used in practice. RESULTS Here we investigate the contribution of a single kmer value in a multi-kmer based assembly approach. We find that a comparative clustering of related assemblies can be used to estimate the importance of an additional kmer assembly. Using a model fit based algorithm we predict the kmer value at which no further assemblies are necessary. Our approach is tested with different de novo assemblers for datasets with different coverage values and read lengths. Further, we suggest a simple post processing step that significantly improves the quality of multi-kmer assemblies. CONCLUSION We provide an automatic method for limiting the number of kmer values without a significant loss in assembly quality but with savings in assembly time. This is a step forward to making multi-kmer methods more reliable and easier to use. AVAILABILITY AND IMPLEMENTATION A general implementation of our approach can be found under: https://github.com/SchulzLab/KREATIONSupplementary information: Supplementary data are available at Bioinformatics online. CONTACT [email protected].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering of Short Read Sequences for de novo Transcriptome Assembly

Given the importance of transcriptome analysis in various biological studies and considering thevast amount of whole transcriptome sequencing data, it seems necessary to develop analgorithm to assemble transcriptome data. In this study we propose an algorithm fortranscriptome assembly in the absence of a reference genome. First, the contiguous sequencesare generated using de Bruijn graph with d...

متن کامل

Comparisons of De Novo Transcriptome Assemblers in Diploid and Polyploid Species Using Peanut (Arachis spp.) RNA-Seq Data

The narrow genetic base and limited genetic information on Arachis species have hindered the process of marker-assisted selection of peanut cultivars. However, recent developments in sequencing technologies have expanded opportunities to exploit genetic resources, and at lower cost. To use the genetic information for Arachis species available at the transcriptome level, it is important to have ...

متن کامل

The Oyster River Protocol: A Multi Assembler and Kmer Approach For de novo Transcriptome Assembly

1 Characterizing transcriptomes in non-model organisms has resulted in a massive increase in our 2 understanding of biological phenomena. This boon, largely made possible via high-throughput sequencing, 3 means that studies of functional, evolutionary and population genomics are now being done by hundreds or 4 even thousands of labs around the world. For many, these studies begin with a de novo...

متن کامل

Comparative studies of de novo assembly tools for next-generation sequencing technologies

MOTIVATION Several new de novo assembly tools have been developed recently to assemble short sequencing reads generated by next-generation sequencing platforms. However, the performance of these tools under various conditions has not been fully investigated, and sufficient information is not currently available for informed decisions to be made regarding the tool that would be most likely to pr...

متن کامل

A glance at quality score: implication for de novo transcriptome reconstruction of Illumina reads

Downstream analyses of short-reads from next-generation sequencing platforms are often preceded by a pre-processing step that removes uncalled and wrongly called bases. Standard approaches rely on their associated base quality scores to retain the read or a portion of it when the score is above a predefined threshold. It is difficult to differentiate sequencing error from biological variation w...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 32 شماره

صفحات -

تاریخ انتشار 2016

Informed kmer selection for de novo transcriptome assembly

نویسندگان

چکیده

منابع مشابه

Clustering of Short Read Sequences for de novo Transcriptome Assembly

Comparisons of De Novo Transcriptome Assemblers in Diploid and Polyploid Species Using Peanut (Arachis spp.) RNA-Seq Data

The Oyster River Protocol: A Multi Assembler and Kmer Approach For de novo Transcriptome Assembly

Comparative studies of de novo assembly tools for next-generation sequencing technologies

A glance at quality score: implication for de novo transcriptome reconstruction of Illumina reads

عنوان ژورنال:

اشتراک گذاری